linguistic context
Detecting Referring Expressions in Visually Grounded Dialogue with Autoregressive Language Models
Willemsen, Bram, Skantze, Gabriel
In this paper, we explore the use of a text-only, autoregressive language modeling approach for the extraction of referring expressions from visually grounded dialogue. More specifically, the aim is to investigate the extent to which the linguistic context alone can inform the detection of mentions that have a (visually perceivable) referent in the visual context of the conversation. To this end, we adapt a pretrained large language model (LLM) to perform a relatively course-grained annotation of mention spans in unfolding conversations by demarcating mention span boundaries in text via next-token prediction. Our findings indicate that even when using a moderately sized LLM, relatively small datasets, and parameter-efficient fine-tuning, a text-only approach can be effective, highlighting the relative importance of the linguistic context for this task. Nevertheless, we argue that the task represents an inherently multimodal problem and discuss limitations fundamental to unimodal approaches.
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.14)
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
- (11 more...)
The time scale of redundancy between prosody and linguistic context
Regev, Tamar I., Ohams, Chiebuka, Xie, Shaylee, Wolf, Lukas, Fedorenko, Evelina, Warstadt, Alex, Wilcox, Ethan G., Pimentel, Tiago
In spoken language, speakers transmit information not only using words, but also via a rich array of non-verbal signals, which include prosody -- the auditory features of speech. However, previous studies have shown that prosodic features exhibit significant redundancy with both past and future words. Here, we examine the time scale of this relationship: How many words in the past (or future) contribute to predicting prosody? We find that this scale differs for past and future words. Prosody's redundancy with past words extends across approximately 3-8 words, whereas redundancy with future words is limited to just 1-2 words. These findings indicate that the prosody-future relationship reflects local word dependencies or short-scale processes such as next word prediction, while the prosody-past relationship unfolds over a longer time scale. The latter suggests that prosody serves to emphasize earlier information that may be challenging for listeners to process given limited cognitive resources in real-time communication. Our results highlight the role of prosody in shaping efficient communication.
- North America > Canada > Quebec > Montreal (0.04)
- Asia > Singapore (0.04)
- North America > United States > Pennsylvania > Philadelphia County > Philadelphia (0.04)
- (6 more...)
Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition
Rousset, Thibault, Kakibuchi, Taisei, Sasaki, Yusuke, Nomura, Yoshihide
This paper investigates the integration of technical vocabulary in merged language models. We explore the knowledge transfer mechanisms involved when combining a general-purpose language-specific model with a domain-specific model, focusing on the resulting model's comprehension of technical jargon. Our experiments analyze the impact of this merging process on the target model's proficiency in handling specialized terminology. We present a quantitative evaluation of the performance of the merged model, comparing it with that of the individual constituent models. The findings offer insights into the effectiveness of different model merging methods for enhancing domain-specific knowledge and highlight potential challenges and future directions in leveraging these methods for cross-lingual knowledge transfer in Natural Language Processing.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- (3 more...)
- Information Technology > Knowledge Management > Knowledge Engineering (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)
Referring Expression Generation in Visually Grounded Dialogue with Discourse-aware Comprehension Guiding
Willemsen, Bram, Skantze, Gabriel
We propose an approach to referring expression generation (REG) in visually grounded dialogue that is meant to produce referring expressions (REs) that are both discriminative and discourse-appropriate. Our method constitutes a two-stage process. First, we model REG as a text- and image-conditioned next-token prediction task. REs are autoregressively generated based on their preceding linguistic context and a visual representation of the referent. Second, we propose the use of discourse-aware comprehension guiding as part of a generate-and-rerank strategy through which candidate REs generated with our REG model are reranked based on their discourse-dependent discriminatory power. Results from our human evaluation indicate that our proposed two-stage approach is effective in producing discriminative REs, with higher performance in terms of text-image retrieval accuracy for reranked REs compared to those generated using greedy decoding.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Texas > Travis County > Austin (0.04)
- North America > United States > Pennsylvania (0.04)
- (10 more...)
Semantics or spelling? Probing contextual word embeddings with orthographic noise
Matthews, Jacob A., Starr, John R., van Schijndel, Marten
Pretrained language model (PLM) hidden states are frequently employed as contextual word embeddings (CWE): high-dimensional representations that encode semantic information given linguistic context. Across many areas of computational linguistics research, similarity between CWEs is interpreted as semantic similarity. However, it remains unclear exactly what information is encoded in PLM hidden states. We investigate this practice by probing PLM representations using minimal orthographic noise. We expect that if CWEs primarily encode semantic information, a single character swap in the input word will not drastically affect the resulting representation,given sufficient linguistic context. Surprisingly, we find that CWEs generated by popular PLMs are highly sensitive to noise in input data, and that this sensitivity is related to subword tokenization: the fewer tokens used to represent a word at input, the more sensitive its corresponding CWE. This suggests that CWEs capture information unrelated to word-level meaning and can be manipulated through trivial modifications of input data. We conclude that these PLM-derived CWEs may not be reliable semantic proxies, and that caution is warranted when interpreting representational similarity
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Asia > China > Hong Kong (0.04)
- North America > Dominican Republic (0.04)
- (4 more...)
Reinterpreting 'the Company a Word Keeps': Towards Explainable and Ontologically Grounded Language Models
We argue that the relative success of large language models (LLMs) is not a reflection on the symbolic vs. subsymbolic debate but a reflection on employing a successful bottom-up strategy of a reverse engineering of language at scale. However, and due to their subsymbolic nature whatever knowledge these systems acquire about language will always be buried in millions of weights none of which is meaningful on its own, rendering such systems utterly unexplainable. Furthermore, and due to their stochastic nature, LLMs will often fail in making the correct inferences in various linguistic contexts that require reasoning in intensional, temporal, or modal contexts. To remedy these shortcomings we suggest employing the same successful bottom-up strategy employed in LLMs but in a symbolic setting, resulting in explainable, language-agnostic, and ontologically grounded language models.
- Europe > Spain > Galicia > Madrid (0.05)
- North America > United States > Minnesota (0.05)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (2 more...)
GeniL: A Multilingual Dataset on Generalizing Language
Davani, Aida Mostafazadeh, Gubbi, Sagar, Dev, Sunipa, Dave, Shachi, Prabhakaran, Vinodkumar
LLMs are increasingly transforming our digital ecosystem, but they often inherit societal biases learned from their training data, for instance stereotypes associating certain attributes with specific identity groups. While whether and how these biases are mitigated may depend on the specific use cases, being able to effectively detect instances of stereotype perpetuation is a crucial first step. Current methods to assess presence of stereotypes in generated language rely on simple template or co-occurrence based measures, without accounting for the variety of sentential contexts they manifest in. We argue that understanding the sentential context is crucial for detecting instances of generalization. We distinguish two types of generalizations: (1) language that merely mentions the presence of a generalization ("people think the French are very rude"), and (2) language that reinforces such a generalization ("as French they must be rude"), from non-generalizing context ("My French friends think I am rude"). For meaningful stereotype evaluations, we need to reliably distinguish such instances of generalizations. We introduce the new task of detecting generalization in language, and build GeniL, a multilingual dataset of over 50K sentences from 9 languages (English, Arabic, Bengali, Spanish, French, Hindi, Indonesian, Malay, and Portuguese) annotated for instances of generalizations. We demonstrate that the likelihood of a co-occurrence being an instance of generalization is usually low, and varies across different languages, identity groups, and attributes. We build classifiers to detect generalization in language with an overall PR-AUC of 58.7, with varying degrees of performance across languages. Our research provides data and tools to enable a nuanced understanding of stereotype perpetuation, a crucial step towards more inclusive and responsible language technologies.
- Asia > India (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- South America > Brazil (0.04)
- (15 more...)
Multidimensional Service Quality Scoring System
This supplementary material aims to introduce the Multidimensional Service Quality Scoring System (MSQs), a review-based method for quantifying host service quality mentioned and employed in the paper Exit and transition: Exploring the survival status of Airbnb listings in a time of professionalization. The framework presented in Figure 1 summarizes the workflow of MSQs. MSQs is not an end-to-end implementation and is essentially composed of three pipelines, namely Data Collection and Preprocessing, Objects Recognition and Grouping, and Aspect-based Service Scoring. Using the study mentioned above as a case, the details of MSQs are explained below. Data Collection and Preprocessing This pipeline can be further divided into three substeps.
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > District of Columbia > Washington (0.04)
- Asia > China > Beijing > Beijing (0.04)
Detect Only What You Specify : Object Detection with Linguistic Target
Object detection is a computer vision task of predicting a set of bounding boxes and category labels for each object of interest in a given image. The category is related to a linguistic symbol such as 'dog' or 'person' and there should be relationships among them. However the object detector only learns to classify the categories and does not treat them as the linguistic symbols. Multi-modal models often use the pre-trained object detector to extract object features from the image, but the models are separated from the detector and the extracted visual features does not change with their linguistic input. We rethink the object detection as a vision-and-language reasoning task. We then propose targeted detection task, where detection targets are given by a natural language and the goal of the task is to detect only all the target objects in a given image. There are no detection if the target is not given. Commonly used modern object detectors have many hand-designed components like anchor and it is difficult to fuse the textual inputs into the complex pipeline. We thus propose Language-Targeted Detector (LTD) for the targeted detection based on a recently proposed Transformer-based detector. LTD is a encoder-decoder architecture and our conditional decoder allows the model to reason about the encoded image with the textual input as the linguistic context. We evaluate detection performances of LTD on COCO object detection dataset and also show that our model improves the detection results with the textual input grounding to the visual object.
- Asia > Middle East > Qatar > Ad-Dawhah > Doha (0.04)
- Asia > Japan > Honshū > Kantō > Kanagawa Prefecture (0.04)
Wanderley
Dialogue systems intend to facilitate the interaction between humans and computers. A key element in a dialogue system is the conceptual model which represents a domain. Folksonomies are very simple forms of knowledge representation which may be used to specify the conceptual model. However, folksonomies suffer by nature from issues related to ambiguity. In this paper, we present a method which uses linguistic context for learning folksonomies from task-oriented dialogues. The linguistic context can be useful for reducing ambiguity, for instance, when using the folksonomies for interpreting utterances. Experiments show that the learned folksonomies increase the accuracy of the interpretation compared when not using the contextual information.